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ABSTRACT 

Thraa itaa raapoaaa aodala wara avaluatod for 
astxaatxng itaa paraaatara and aquating taat scoraa. Tha aodala, 
which approxiaatad tha traditional thraa-paraaatar aodal, includad: 
(1) tha Raach ona-paraaatar aodal, v^wrationalisad in tha BICAL 
coaputar prograa; (^) an approxiaata thraa-paraaatar logistic aodal 
basaa on coarsa group data dividad into fifths and twantiaths, and 
ustog tha Quantila aodiCication of tha L06I8T prograa; and (3) a 
aodifiad thraa-partaatar logistic aodal with fizad a's and c's, using 
tha L06IST coaputar prograa. Tha data caaa froa a study of tha 
Scholastic Aptituda Tast (SAT), which involvad tha chain aquating of 
a tast to itsalf through fiva intaraadiary foras; approziaataly 2,670 
casas wara usad for aach SAT fora. Rasults showad that itaa 
calibrations basad on twantiaths wara closar to tha trua valuas and 
to LOG I ST astiaatas than thosa basad on fifths, but tha aquating 
rasults basad on twantiaths wara not aora accurata. Mathod (2) 
yialdad highly accurata scora coavarsions in aquating a tast to 
itsalf, and all thraa aodala yialdad vary accurata aquating rasults. 
Quaations wara raisad about tha adaquacy of aquating a tast to itsalf 
as a critarion for avaluating aquatii^g rasults. Furthar rasaarch was 
recoaaandad bafora adopting any of tha approziaata aodala. Twalva 
tablas and 22 figuraa ara appandad. (Author/GDC) 
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An Evaluation of Three Approximate Item Response Theory 
Models for Equating Test Scores 

Abstract 

The primary purpose of this study was to determine the extent to which 
three item response theory (IRT) models could be used to approximate the three- 
parameter logistic model in estimating item parameters and in equatln^^ test 
scores. These approximate models were less expensive to apply and in some 
cases used less data than the full-blown three-parameter model. 

The approximations co the three-parameter model used in this study were 
(1) the Rasch one-parameter model, as operationalized in the BICAL computer 
program, (2) an approximate three-parameter logistic model based on grouped data 
divided into fifths and twentieths, and (3) a modified three-parameter logistic 
model with fixed a's and £'s. The LOGIST computer program was used to estimate 
parameters for the modified three-parameter model; Quantile, a modified version 
of LOGIST that accepted coarsely grouped data, was used to estimate item 
parameters for the approximate three-parameter model. 

In the case of the approximate model'^ involving BICAL and lOGIST, results 
of separate item calibrations were used to place item parameter estimates on 
the same scale. In the case of the approximate model involving Quant ile, a 
method of scaling the item parameter estimates indirectly through existing 
SAT scaled scores was used. 

The data for the study came from a recent study (Petersen, Cook, & Stocking, 
1983) of scale stability for the Scholastic Aptitude Test. As in the previous 
study, this study involved the chain equating of a test to itself through 
five intermediary forms. The sample consisted of approximately 2,670 cases 
for each of the SAT forms used. 
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The results of the study were as follows: (1) the item calibrations 
based on twentieths were closer to the true values and to LOGIST estimates 
than item calibrations based on fifths; (2) the equating results based on 
twentieths, however, were not more accurate generally than those based on 
fifths; (3) the three-^ •'rameter model using coarse groupings yielded highly 
accurate score conversions in equating a test to itself, nore accurate in 
fact than the full-blown three-p^irameter models studied by Petersen, Cook, 
and Stocking; and (4) all of the approximate models yielded very accurate 
equating results. A follow-up analysis indicated that these unexpected 
equating results were due in large part to the indirect method used to place 
item parameter estimates on scale through existing score conversions derived 
from conventional equating methods. The success of the approximate models 
raises a question about the adequacy of equating a test to itself as a 
criterion foi: evaluating equating results. Further research is recommended 
before any of the approximate models are used operationally. 



An Evaluation of Three Approximate Item Response Theory 
Models for Equating Test Scores^ 



The increasing Internal and external demands made on testing programs have 
underscored the Inflexibility of score equating methods used traditionally. 
Item response theory (IRT) equating offers several advantages In this context, 
including Improved equating (particularly at the ends of the scale) , greater 
test security through less dependence on itoms common to a particular previously 
used form, and easier re-equating when items are added or deleted. While these 
are important advantages, test disclosure legislation has created a more urgent 
need for IRT-based equaling. The New York State test disclosure legislation 
requires th^t those items on which reported scores are based be made avail- 
able to the public. An important advantage of using IRT methods in response 
to such legislation is that equating based on item pretest data is possible 
prior to a test's administration (pre-equating) , thus permitting forms to be 
developed without requiring a special equating administration. 

Although IRT methods of equating are superior to traditional methods in a 
number of important respects (see Marco, Petersen, & Stew^'.rt, 1983), the costs 
of converting from traditional to IRT equating methods can be substantial. The 
LOGIST computer program (Winger sky, 1983) and other computer programs used to 
estimate IRT item parameters for the three-parameter logistic test model take a 
considerable amount of computer time and thus are expensive to run for large 

data sets. The costs are particularly high when IRT methods are iatroduced into 
an existing testing program because the parameters of a large number of items 
must not only be estimated using a program like LOGIST but also placed on a 
common scale through complicated commons-item linkages. 
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Previous research suggests that approximate IRT methods might be useful 
when the objective Is to equate test scores. In a study of PSAT/NMSQT pre- 
^quatlng, Marco (1977) used an approximate method for placing item parameter # 
estimates on a common scale to avoid the considerable expense of calibrating 
Items from a large number of test forms* He used existing score equating 
results based on traditional linear equating to scale Item parameter estimates 
from separate applications of LOGIST. Marco found that, except at the upper end 
of the score scale, the pre-equatlng results agreed reasonably well with the 
criterion equatlngs. Different item calibration techniques have also been 
compared. In a simulation study Ree (1979) compared item parameter and ability 
estimates obtained from LOGIST and two of Urry's programs, ANCILLES and OGIVIA, 
with the parameters from which the simulated data had been generated. He found 
none of the programs uniformly superior for parameter estimation. However, the 
cost of using Urry's programs were only lOZ to 15% of the cost of using 
LOGIST. 

Several studies have evaluated the Rasch model, the simplest IRT model, for 
equating test scores, Rentz and Bashaw (1975) equated scores on a number of 
elementary school reading tests with the Rasch model. They found good agreement 
between the equating results of the equipercentile and Rasch models. In another 
study Douglass (1980) found tiat the Rasch model provided better equating results 
than the two-parameter logistic model for a classroom achievement testing system. 
The Rasch equatlngs were more consistent across different-sized examinee samples. 
There was also evidence that, compared to two-parameter equatlngs, Rasch equat- 
lngs tended to result in less equating error when dissimilar examinee samples 
were used. In a large scale study of score equating methods, Marco, Petersen, 
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and Stewart (1983) compared one-'parameter and three^parameter logistic, equlper*- 
centlle, and linear score equating models under varying conditions. They found 
that when a test was equated to Itself using random samples, all of the equating 
models had a small amount of equating error. But when dissimilar samples were 
used, both the one- and three-parameter logistic models were clearly superior. 
However, when a rest was i>quated to a test differing In difficulty, the equating 
results of the one-parameter model were unsatisfactory. In another study 
of score equating models, Kolen (1981) compared linear, equlpercentlle, and 
one-, two-, and three-parameter logistic score equating models. Like Marco, 
Petersen, and Stewart, he found that the one-parameter logistic model yielded 
Inadequate results for equating tests of unequal difficulty. Other studies 
(e.g., Sllnde & Linn, 1978; Loyd & Hoover, 1980; and Holmes, 1982) have also 
evaluated the adequacy of the Rasch model for score equating, with mixed results. 

These studies from the IRT research literature support the possible utility 
of using approximate methods, but also call attention to conditions under which 
approximate methods might give unsatisfactory results. For a test that has 
little form-to-forra variation and only moderate differences in the ability of 
the examinees from one administration to another, there is good reason to expect 
that approximate methods might provide acceptable results at a much lower cost. 
Of course, approximate methods would be most useful in small testing programs, 
which cannot afford to use the more expensive methods. 

The primary purpose of this study was to determine the extent to which the 
many advantages of IRT score equating could be realized by using approximate 
models that were less expensive to apply and, in some cases, required less 
data than the full-blown three-parameter logistic model, as operationalized 
in the LOGIST computer program. Three IRT equating models intended to approxi- 
mate the three-parameter logistic equating model were studied. Various 
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hypotheses concerning these models were formulated. These hypotheses are 
outlined in the first part of the section on results. 

Procedures 

The tests and examinee samples for this study were those used for a recent 
Scholastic Aptitude Test (SAT) scale stability study (Petersen, Cook, & Stocking, 
1983). That study investigated several methods for equating scores from six 
SAT-verbal and six SAT-mathematical test forms. Included among the methods were 
linear equating, equipercentile equating, and several variants of IRT equating. 
The study involved the chain equating of a test to itself through five inter- 
mediary forms. The current study builds on chese results by providing data on a 
number of additional equating methods intended to approximate three-parameter 
logistic equating. 

Tests and Test Scores 

The tests consisted of six operational and six equating tests for SAT-verbal 
and SAT-mathematical, respectively, administered between December 1973 and May 
1979. The tests were chosen so that the equatings formed a closed circle in 
that a test form could be equated to itself. These tests are identified in 
Figure 1, which shows the chains of six verbal and six mathematical equatings 
that were used in the study. SAT forms are indicated by upper case letters and 
equating tests by lower case letters. Each SAT-verbal form except Form V4, 
which was administered prior to the Fall of 1974, had 85 items; Form V4 contained 
90 items. A given verbal equating test contained 40 items. Each SAT-mathemati- 
cal form except Form Y3 had 60 items; and each mathematical equating test except 
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Form fn, 25 Items, Due to printing error, SAT -mathematical Form Y3 had 59 
items, and mathematical equating test fn had 24 items. 



Insert Figure 1 about here 



The SAT was shortened from 75 minutes to 60 minutes in the fall of 1974 to 
permit the administration of the SAT's companion test, the Test of Standard 
Written English. The shorter SAT-verbal forms contained the same item types as 
the previous forms, but the numbers of items (all five-choice) within a given 
item type were changed. The snorter SAT-mathematical form contained quantitative 
(four-ch jice) coTuparisons and regular mathematics (five-choice) items instead of 
data sufficiency (f ive-choJce) and regular mathematics (f ive^-choice) items. 

Raw scores on the SAT are formula scores based cn the number right minus a 
fraction of the number wrong, where the fraction is l/(no. of response options - 1). 
Raw scores for a particular test form are converted to scaled scores on the 200 
to 800 College Board scale by applying the mathematical transformation derived 
through score equating. 

Data Used in the Study 

The sample consisted of approximately 2,670 cases for each pairing of an 
SAT form and an equating test shown in Figure 1. The actual sample sizes ranged 
from 2,527 to 2,879. The samples were randomly selected from examinees taking 
the SAT at the respective administrations. Figure 2 shows the data sets that 
were used in the study. Individual records contained item response data appro- 
priate for use in the various computer programs, which required information on 
right and wrong responses to each test item. Records also contained information 
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on Items omitted and Items net reached. Table 1 gives the SAT scaled score 
means and standard deviations for the samples used in the study. 



Insert Figure 2 and Table 1 about here 



Equating Design 

Figure 1 shows the chains of six verbal and six mathematical equatings that 
were used in the study. These chains were also used in the SAT scale stability 
study. In that study and in this one SAT -verbal Form V4 was equated to itself 
through several intermediary forms. Form V4 was treated as the base form of the 
test for equating scores on Form Z5 to scores on Form V4, Form 25 in turn was 
treated as the base form for equating scores on Form Y2 to scores on Form Z5. 
In the last step scores on Form V4 were equated to scores on Form X2 using Form 
X2 as the base. The results of this chain equating could be compared to the 
original scores. Ideally, the results would be identical. Any discrepancy 
could be attributed to the particular equating method used. In the 1981 study 
all equatings made use of common item linkages established by the equating 
tests. In the current study some equatings depended upon the equating test 
data, and oome did not. 

The idea of equating a test to itself as a way of evaluating equating 
methods was Introduced by Levine (1955) when he developed several linear true 
score equating methods. Marco, Petersen, and Stewart (1983), in their study of 
curvilinear equating methods, also used this type of criterion. This idea was 
extended by Petersen, Cook, and Stocking (1983) to chain equating, whereby a 
test is equated to itself through a series of inuermediate forms. In this way 
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variations in cest length, test difficulty, etc., can be introduced to discover 
to what extent various equating models can adapt to changing conditions. 
» The reason that equating a test to Itself Is such a powerful lde(. Is that 

when a test Is equated to another test* the tru: rolationshlp of the scores Is 
not known. Thus, when studying equating In a natural setting, equating a test 
to Itself is the only way to ensure the availability of a known criterion. When 
a test Is equated to a different test^ simulations can be used to establish a 
known criterion, but then It Is difficult to Introduce the kind of variation 
that exists naturally. 

In the equating chain used for this study and the SAT scale stability 
study, SAT forms differed systematically only In that Form V4 was administered 
before the time limit for SAT-verbal or -mathematical was changed from 75 
minutes to 60 minutes. This decrease In time limits necessitated a change 
in the mixture of Item types In SAT-verbal and the Introduction of Quantitative 
Comparison Items In SAT-mathematlcal. These changes plus the natural variation 
from form to form probably Introduced some curvilinear Ity Into the equating 
relationship, and some differences In reliability could be expected from the 
changes In test lengths. 

Equating Models 

Th *e are three separate but related steps required to use Item response 
theory for score equating. The first step, item calibration, is to estimate 
, item parameters. The second step, item parameter transformation - required when 

item parameters are estimated in separate computer runs, is to place item 
parameters on & common scale. The third step, score equating, is to relate raw 
scores on various pairs of tests to underlying abilities. In IRT true score 
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equating, the only equating method used In this study, scores on two tests rre 
considered to be equated If and only If the true scores correspond to the saite 
underlying ability level • However, various Item calibration and Item parameter 
transformation procedures were used. The type of data set (data from two SAT 
forms and an equating test or data from one SAT form) also varied, depending on 
the equating method* 

The approxlriatlons to the three-parameter logistic model u^ed In this 
study were (1) the Rasch one-parameter logistic model, (2) an approximate 
three-parameter logistic model based on groiv divided Into fifths and twenti- 
eths, and (3) a modified three-parameter logistic model with fixed a^'s and c*n. 
The one-parameter model was Included in the study for comparative purposes 
because of its relative simplicity and Its wide use In some professional circles. 
The BICAL computer program was used to estimate Item parameters for the one- 
parameter model; and LOGIST, for the modified three-parameter model* Quantlle, 
a modified version of LOGIST written for this stucy, was used to estimate item 
parameters for the approximate three-parameter model* 

If Item 1 «ameters are estimated In separate computer runs, the scales 
underlying the estimates will, according to the theory, differ by a linear 
transformation* Thus, before such estimates can be used for score equating, 
they must be transformed to a common scale* Tuia can be accomplished In several 
ways* In this study Item parameters were placed on a common scale (I) by cali- 
brating concurrently Items from the pairs of test forms whose scores were to 
be equated and Lhelr common equating tests; (2) by equating the _b's, the Item 
difficulties, using parameter estimates for the equating test Items from separate 
Item calibrations; and by equating e's, examinee abilities, indirectly using 
existing operational score equating parameters* The third procedure had been 
used by Marco (1977) to equate item parameter estimates from different samples 
Q when there is no equating test* l A 
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Table 2 shows the variations associated with these three approximate equat* 
Ing models. The samples sizes were the same for the three models - approxl* 
mately 2,670 cases for each sample. Also, the same type of equating was used in 
each case; namely, IRT true score equating. This kind of equating Is described 
at the end of this section. The approximate equating models varied as to the 
type of data used, the method of Item calibration, and the method of Item param* 
eter equating. The various data sets used In the study have already been Ident* 
1/led in Figure 2. 



Insert Table 2 abouc here 



One-parameter logistic (Rasch) model . The computer program BICAL (Wright 
& Mead, Note 2) was used to c^Albrate the Items from the 12 verbal and 12 
mathematical data sets shown In Figure 2. A separate application of BICAL was 
made for each data set. Since BICAL provides Item parameters on a separate 
scale for each Item calibration, the Item parameter estimates had to be trans-* 
formed* This was accomplished by setting equal the means from the two callbra^ 
tlons of the common Items (Wright & Stone, 1979). (For this method an additive 
constant provides the appropriate adjustment to the b^*s.) For example, the Item 
difficulty parameter estimates for SAT-verlal Form X2 and Equating Test fe were 
equated to the scale for SAT-verbal Form *4 and Equating Test fe by subtracting 

090. This constant was found by subtracting the uean b^ for the Items In the 
equating test fe for the examinees who took Form X2 from the mean b^ for the same 
items for the examinees who took Form V4. Table 3 gives the equating transfor-* 
mations used to place BICAL estimates on the Form V4 scale. The transformation 
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for placing the Item parameter estimates for any particular form on the V4 Fcale 
was obtained simply by summing the constants In the chain. In this way, a 
transfonnaC:lon was obtained for equating the Item parameters for Form VA to the 
V4 scale indirectly. Ideally the constant would equal 0. This final transfor- 
mation Is given In the last line of the table. 



Insert Table 3 about here 



Approximate threes-parameter logistic model . This model was intended to 
approximate the three*parameter logistic model using grouped data. Previously, 
Bock vl976) had used coarse grouping in the computer program LOGOG to ef?tlmate 
item parameters and had obtained relatively accurate, albeit inconsistent, 
estimates with large sawples. Considerable cost savings could result if the 
abilities for each examinee did not have to be estimated. This model was 
designed to calibrate iten>s on the basis of item analysis information routinely 
pioduced at Educational Testing Service. 

A new computer pr«^gram Quantile was developed by modifying LOGIST to 
accept grouped data. The Quantile version estimates the item parameters for a 
test using item responses of groups of examinees instead of responses of indi- 
vidual examinees. The examinees are divided into groups before the program is 
applied. For each item the input to the program is the number of examinees in 
each group who answered the item correctly plus a fraction (l/(no. of response 
options)) of the number who emitted the item, the number of examinees in each 
group who reached the item, the total number of examinees who answered the item 
correctly, and the otal number of examinees who omitted the item. All of the 
exaiAlnees in a group are treated as having the same ability. This ability is 




estimated using maximum likelihood in the same manner as individual abilities 
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are estimated by LOGIST. The options available and the output are identical to 
the LOGIST options and output. 

All of the required information can be derived from routine ETS item 
analysis data from groupings based on fifths. For purposes of the study the 
special program Anytiles was written to allow groupings into both fifths and 
twentieths. The latter grouping was used for comparison purposes even though 
not produced by routine item analysis. 

Quantile produces estimates of all three item parameters for the logistic 
test model. Before the program was used in the study, it was tested on arti-* 
f Icial lata. These test runs indicated that the a^'s and c^'s were underestimated 
when compared with their true values. When the _c's were fixtsd at their true 
values, however, the £'s were unbiased. (Previous comparisons of LOGIST results 
with true values from artificial data had demonstrated that LOGIST item parameter 
estimates based on individual data are unbiased when based on large samples.) 

To correct for this bias, an empirical correction was computed using 
the item calibrations for the SAT-verbal and SAT-mathematical data from the 
March, May» and June 1982 administrations for which LOGIST item calibrations 
existed. Separate corrections for coarse groupings (fifths and twentieths) 
were computed for verbal items (five-choice), four-choice mathematical items, 
and five-choice mathematical items. These empirical corrections were derived 
in the following way: (1) The b parameter estimates from Quantile were equated 
to the b parameters from LOGIST to place them on the same scale. (Means and 
standard deviations were set equal.) (2) The a parameter estimates were 
equated to the a's from LOGIST by setting means and standard deviations equal 
after removing any pairs where either a was greater than 1.5. (3) Step 2 was 
repeated for the £'s, removing any items \»ith either £ at the common value or 
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c greater than .4. For the c's that were not estimated but set equal to the 
comiion C9 the n^ean c was compared to the mean c from LOGIST for these same 
Items and a constant adjustment obtained. (4) Results for the March, May, and 
June data setB were compared and average linear ^iransformatlons obtained* 
Table 4 gives the average corrections applied In the study. 



Insert Table 4 about here 



The Quant lie (program was applied to Item response data from the SAT forms; 
equating test data were not used. SAT-verbal data from the following datu sets 
shown In Figure 2 were used: V4 and fe, X2 and fm, Y3 and fw, B3 and fk^ Y2 and 
fu, Z5 and et, and V4 and et. The SAT-mathematlcal data sets consisted of data 
from V4 end ff , X2 and fn, YJ and fu, B3 and fl, Y2 and fv, Z5 and eu, and V4 
and eu. 

Once these Item calibrations were available, they were adjusted by applying 
the appropriate empirical corrections (see Table 4). The parameter estimates, 
both corrected and uncorrected, then had to be transformed to a common scale. 
This was accomplished using the operational score equating parameters In the 
manner described by Marco (1977). The essential steps were as follows: 

For a given ability level, 6^^, for test 1, compute the true number- 
right score, R^, by ZP , where I , the probability of answering Item 
g on test 1 correctly. Is computed from the estimated parameters for 
Item g. 

Express as a true formula score, FS^, under the assumption that all 
Items are answered: FS - R - (N5 - Rj)^ for SAT-verbal (five-choice 
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Items) and FS - R - (N^ - R^)/3 - (N^ - Rj)/4 for SAT-mathematlcal 
(four- and five-choice Items), where N is the number of test Items and 
th^i subscript Indicates whether four-choice (4) or five-choice (5) 
Items are Involved. 

3. Transform PS^ to the College Board scale (S^) using the operational 
scaling parameters derived previously when the tests were originally 
equated (see Table 5): ■ ll* 

4. Find the true FS^ on nest j corresponding to this scaled score: 

lij - (S -Bj)/Aj . 

where Aj and are the scaling parameters (see Table 5). 

5. Compute the true from the formulas In Step 2. 

6. Determine the corresponding ability level 9^ for test j by finding the 
^ for which l?^ equals Rj using the Item parameter estimates for 
test j. 

7. Apply steps 1-6 to approximately 60 equally spaced ability levels 
between -3 and 3. 

8. Determine the straight line relating the e^'g to the e^*s in the range 
-1.75 to 1.75 by setting their means and standard deviations equal* 
The range Is restricted to prevent outliers In the tails of the score 
range from Influencing the results. This process results in a transfor- 
mation ^ - A dj + B. Here A - SD(ei)/SD(ej) and B - M(ei) - aMO^), 
where M and SD stand for mean and standard deviation, respectively* 

9. Determine the transformation for placing the item parameter estimates 
for test J onto the scale of the item parameter estimates for test 1 
as follows: 
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- Abj + B and aj; ■ «j/A . 
The £*8 are unaffected by scale transformations. 



Insert Table 5 about here 



Table 6 gives the final transformations to the Form V4 scale for Item diffi- 
culties estimated from the Quantlle computer program. Corrected transformations 
were not determined for SAT-mathematlcal, for the SAT-verbal results (see the 
section on results) Indicated that the corrections did not Improve the equating 
transformations. Ideally, the transformation from Form V4 to the Form V4 scale 
would be (l.O X M + 0; that Is, the original value would be returned. 

Once the Item parameter estimates were transformed to the V4 scale, the 
scores on Form V4 could be equated to themselves. For this purpose only the two 
sets of V4 Item parameters estimated were utilized; the use of Intermediate 
estimates were unnecessary. 



Insert Table 6 about here 



Modified three-parameter logistic model . A reasonable alternative to the 
three-parameter logistic model Is a modified version that Involves fixing the 
Item discriminations at a common value and fixing the lower asymptotes at a 
common non-zero value for all Items v This model has two potential advantages 
over the one-parameter logistic model discussed previously. First, for multiple- 
choice Items a lower asymptote greater than zero Is a more reasonable assumption 
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than a lower asymptote of zero. Second, the LOGIST program takes omitted and 
not reached Items Into account, whereas the BICAL program considers omitted and 
not reached Items as Incorrect responses. 

LOGIST was used to estimate Item parameters for the modified three-parameter 
model. Item parameters a^ and £ were fixed it .788 and .149, respectively, for 
SAT-verbal and at .898 and .113 (for flve-cholcc Items) or .155 (for four-choice 
Items), respectively, for SAT-mathematlcal. The fixed values were averages from 
previous SAT Item calibrations from LOGIST. 

Each Item calibration Involved two samples of examinees, one which took an 
SAT and an equating test and one which took a different SAT form but the same 
equating test. For example, data from examinees who took SAT Form V4 and 
Equating Test fe were merged with the data from examinees who took SAT Form X2 
and Equating Test fe for calibration purposes (see Figure 2). The LOGIST 
computer program permits this type of calibration, even though all examinees do 
not answer all Items, and returns Item parameter estimates on the same scale for 
both SAT forms and the equating test. Thus, there is no need to derive a 
separate transformation to equate the item parameters from the two forms. 

Score equating was accomplished by equating successively the scores of the 
forms represented In the concurrent Item calibrations. Thus, transforming item 
parameter estimates to a common scale wae unnecessary. The scores for Form X2 
were equated to the scores for Form V4 using the item calibration Involving 
these two forms > and raw-to-scaled-score conversions were obtained. Then Form 

« 

X2 scores transformed to the V4 scale were used as input for the equating of 
scores on Form Y3 to scores on Form X2. This process yras continued until the 
chain ended with the equating of scores on Form V4 to scores on Form Z5» This 
final raw-to-scaled-score conversions for Form V4 could then be compared with the 
original conversions. 
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Score Equating Method; Curvtllncur True Score Equating 

Once appropriate Item parameter estimates were available for the approximate 
mechods, score equating could be accomplished. IRT true score equating was used 
In each instance. T "-'lod was also used in the SAT scale stability study. 
Lord (1980) discussed this kind of equating In his recent book. Here only a 
brief summary of the method Is given. 

The first step in the score equating piocess Is to compute the true number- 
right score from the Item parameters that have been placed on a common scale. 
The true number-right score Is a function of the ability level 8 and the 
Item parameters a^, b^, and c^. For the one-parameter logistic model £ ■ i/1.702 
and £ » 0 for every Item. (The division by 1.702 is necessary to make the 
results of BICM consistent with the results of LOGIST). Let R stand for the 
number right true scor^i. Then R(&) ■ P (a), where 

Zg(i) - Cg + (1 - Cg)/(1 + exp(1.702 ag (e-bg))). 

The true forroala score is obtained by assuming that everyone answered all Items, 
so that 

FS - R - (N - R)/(k _ 1), 

where N is the number of test items and k is the number of response options for 
the item g. In the case of SAT-mathematical, vhlch contained both four- and 
five-choice items two correction terms were used - one for four-choice items 
and one for five-choice items. 

True formula scores on two tests are said to be equated If they are 
functions of the same B^. To obtain the raw-to-scaled score conversions for 
raw scores, it is necessary only to determine 6 f o . each FS^ on the test form 
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and to find the corresponding FS on the other form. In practice, of course, 
estimates of the Item parameters rather than the unknown true parameters are 
used for calculating the true scores. 

Fur the one-parameter logistic model and the approximate threes-parameter 
logistic model, equating Form V4 to Itself was accomplished in one step once the 
Item parameters were transformed. For the modified three-parameter model, the 
results of the Item calibrations were applied stepwise, starting with the 
equating of raw scores on Form V4 to raw scores on Form X2 (sec Figure 1), and 
continuing around the circle by feeding In score conversions from the previous 
equating. At the end of the chain. Initial raw scores on Form V4 were trans- 
formed to scaled scores by applying the original scaling parameters for Form V4 
(s2e Table 5) to the equated raw scores corresponding to these Initial raw 
scores. These "final" scaled scores could then be compared to the Initial 
scaled scores* 



A number of specific hypotheses were formulated for the study and expected 
to be confirmed: (a) The Item parameter estimates from the groups based on 
twentieths more closely match the three-parameter logistic estimates from LOGIST 
than the estimates based on fifths, (b) The Item parameter estimates corrected 
for coarse grouping more closely match the three-parameter logistic estimates 
from LOGIST than the uncorrected estimates, (c) The approximate three-parameter 
logistic equating model using Item parameter estimates from the groupings based 
on twentieths yields less equating error than the model using estimates based on 
fifths, (d) The approximate three-parameter logistic equating model using Item 
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parameter estimates corrected for coarse grouping yields less equating error 
than the model using uncorrected estimates, (e) The approximate equating models 
yield more equating error than the concurrent equating model, (f ) The more 
complex approximate model - the modified three-parameter model - yields less 
equating error than the other approximate models, (g) Because It utilizes only 
Item difficulty parameters, the one-parametar logistic model (Rasch) yields more 
equating error than any of the other approximate models. 

In this set of hjrpotheses reference was made to the concurrent equating 
model, which was evaluated In the SAT scale stability study (Petersen, Cook, 
& Stocking, 1983), and, beginning In January 1982, Is being used operationally to 
equate SAT scores. In concurrent IRT equating Item parameter transformation Is 
unnecessary as a separate step. Items from a new form, an old form, and a 
common anchor test are calibrated together using LOGIST, which produces Item 
parameter estimates on a common scale. Then new form scores are equated to old 
form scores on the basis of these Item parameter estimates. For the next 
equating the new form becomes the old form. Items from this form, another new 
form, and another coimnon anchor test are calibrated together, and the scores on 
the total tests are equated. This kind of sequential "palrwlse" equating was 
Judged the most adequate of the three-parameter logistic IRT equating models 
represented In the SAT scale stability study. The results for concurrent 
equating referenced In this report are taken from that study. 

Comparisons of Item Parameter Estimates 

The Item parameter estimates from the Quantlle computer program were eval- 
uated for a set of artificial data on 45 Items and 1,500 examinees for which 
he true parameter values are known and LOGIST results already exist. The 
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Quantlle Item parameter estimates were compared to the true values and to the 
L06IST estimates. Since for those data LOGIST parameter estimates are known to 
have negligible bias, the LOGIST results can be used as a criterion for deter- 
mining bias in the Quant ile estimates where the true item parameters are unknown. 
The abilities for each set of parameter estimates were standardized to a mean of 
0 and standard deviation of 1 for abilities between -3 and 3, and the item 
parameters adjusted accordingly to put all of the parameters on a common metric. 

The plots comparing the Quantlle data to the true values are shown in Figure 3 
for the fifths and Figure 4 for the twentieths. The circles on the £ and £ plots 
indicate items for which £ was set to a common c^ value by the computer programs. 
The c^a and the larger a^'s are underestimated. This same bias is evident In the 
plots in rigutCtf 5 and 6, comparing the Quantlle estimates to the LOGIST estl-* 
mates for the fifths and the twentieths, respectively. 



Table 7 gives the summary statistics for the comparisons for the artificial 
data. The "mean absolute differences between the item response functions** is 
the absolute difference between the two curves averaged over all of the exam- 
inees and over all items. This ts highest for the LOGIST results compared to 
true values. Asympototically, LOGIST minimizes the weighted mean squared error, 
not the mean absolute error. For all of the other statistics the LOGIST results 
agree better with the true values than do the Quant ile results. The hypothesis 
that the twentieths give better estimates than the fifths is confirmed . The 
item-ability regressions in Figure 7 show the effects of using grouped data. 



Indert Figures 4, 5, and 6 about here 
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For Item 21 and most other items, the estimated curve fit the true curve very 
well. For Item 28 and a few other items, however, poor fit resulted. 



Insert Table 7 and Figure 7 about here 



For the two sets of data for Form V4 scale, Quantile results were compared 
to the LOGIST results using the uncorrected item parameter estimates and the 
estimates corrected for bias. The summary statistics are given in Table 8. 
Figures 8 to 11 show the comparisons for SAT Form VA and equating Test fe for 
uncorrected and corrected item parameter estimates, respectively. Figures 12 to 
15 show the comparisons for SAT Form V4 and Equating Test et. The hypothesis 
that the item parameter estimat es corrected for coarse grouping more closely 
match the LOGIST resu lts than the uncorrected estimates was confirmed only for 
the b's and c's of Form V4-et. For the a's and c's of SAT Form V4 and Equating 
Test f3, the uncorrected estimates agree better with the LOGIST results than do 
the corrected estimates, and tor them's the corrections had negligible effect. 



Insert Table 8 and Figures 8-15 about here 



Cocparisf ns of Equat ing Results - Descriptions of the Tables and Figures 

The equating results from the various approximate equating models are given 
In Table 9 and Figures 16 and 17 for SAT-verbal and in Table 10 and Figure 18 
for SAT -mathematical. Tables 9 and 10 give the point-by-point conversions. In 
these tables the Initial scaled score (the criterion) is the scaled score that 
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was obtained from the original equating. The final scaled score Is the scaled 
score that was obtained by applying the original raw-to-scaled score conversion 
parameters for Form V4 to the equated raw scores resulting from the chain 
equating. The results are directly comparable with those obtained in the SAT 
scale stability study. 



Insert Tables 9 and 10 and Figures 16, 17, and 18 about here 
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Figures 16, 17, and 18 present the equating results graphically. These 
graphs show the differences between the model results (final seeled scores) and 
the criterion (Initial scaled scores). The codes used In the flgurec are as 
follows : 

CRIT: iterlon—lnltlal scaled score, 

CONCUR: Concurrent (three-parameter logistic), 
BICAL: BICAL (one-parameter logistic), 

W05: Approximate three-parameter logistic based on fifths without 

corrections to Item parameter estimates, 
W5: Approximate three-parameter logistic based on fifths with 

corrections to Item parameter estimates,. 
W020: Approximate three-parameter logistic based on twentieths 

without corrections to Item parameter estimates, 
W20: Approximate chr*!— parameter logistic based on t'^entleths with 

corrections to Item parameter estimates, and 
M0D3; Modified three-parameter logistic. 
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Table 11 summarizes the polnt-*by-*polnt results In terms of several dlscrep-* 
ancy Indices* ^ ^Ives the means and standard deviations for the scaled scores 
resulting rrom the various equatlngs and for the criterion scores* Ideally, the 
mean and standard deviation for a particular model would correspond exactly to 
the mean and standard deviation of the criterion scores* 



Insert Table 11 about here 



The table also gives the %ielghted mean squared difference and its two com<* 
ponents, the mean difference and the standard deviation of the difference* For 
each raw score x on Form V4 there are final scaled scores resulting from the 
various equatlngs and an initial scaled score* The smaller the differences 
between the final score, _tj^ for a particular equating model and the initial 
score, t^^ more accurate the equating model is* To compute the weighted 
mean squared differences the values of x are weighted according to their actual 
occurrence in some reference group* The weighted mean squared difference is 
equal to the variance of the difference plus the mean difference squared; that is 

where d^ « (t| - tj), tj is the estimated scale score for raw score ij, tj is 
the initial or criterion scale score for Xj^ fj la the frequency of Xj, 

— " 1 " ^Jjij/il» summation is over that range of x observed 

across samples. The values in Table 11 were computed from the data in Tables 
9 and 10, summing over verbal raw scores 1 to 80 and mathematical raw scores 
-8 to 55 using the corresponding frequencies for the total group taking Form 
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V4 when it was first admiaistered in Dec«nber 197 3, The results are directly 
comparable with those in the SAT scale stability study. 

Figure 19 depicts the weighted mean squared difference In terms of its two 
components. The curved lines in the figure represent four levels of weighted 
mean squared error: 25, 100, 225, and 400. A particular point on a line is 
equal to the standard deviation of the difference squared plus the mean differ- 
ence squared. The equating models are represented by numbers In the case of 
verbal equatings and by letters In the case of mathematical equatlngs. 



Insert Figure 19 about here 



Comparisons of Equating Results - Variations of the Approximate Three-Parameter 
Model 

The two primary variants of the approximate three-parameter logistic 
equating model were based on two groupings of examinees: fifths and twentieths* 
It was expected that the item parameter estimates based on twentieths would be 
more accurate and that the equating results would also be more accurate* It is 
clear from the analysis of the item parameter estimates that the estimates based 
on twentieths were more accurate, but Figures 16 und 17 show that this increased 
accuracy carried over only slightly to the equating results. In fact, in some 
parts of the score range the results based on fifths were more accurate; and, in 
the case of SAT-mathematical, the results based on fifths were superior to those 
based on twentieths (see Figure 17 and Table 11). The total mean squared error, 
however, was small for the approximate three-parameter logistic equating models. 
Thus, the differences in the results are of little practical significance. The 
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fact that there were only small differences between the models suggests that 
coar&e groupings for obtaining Item parameter estimates may be adequate for some 
equating purposes* 

Corrections for coarse grouping were used for the two variations of the 
approximate three-parameter logistic equating model* The corrected Item param- 
eter estimates were applied to the SAT<-verbal data^ with the Intention of using 
corrected SAT-mathematlcal estimates If the verbal results Indicated that the 
corrections were useful* The previous comparison of corrected and uncorrected 
estimates has already Indicated that the corrected estimates were less accurate 
than the uncorrected estimates for SAT^erbal* Nevertheless » the equating 
results based on corrected estimates were expected to be more accurate than 
those based on uncorrected estimates* It Is clear from Figure 17 that the 
corrections had little effect on the equating results* If anything, overall 
equating accuracy decreased (see Table 11)* There are several possible expla* 
nations as to why the corrections were not effective* First » the Quantlle 
computer program produced relatively accurate estimates of the Item parameters, 
aa has already been discussed* Any corrections might simply have added noise to 
the estimates* Second, the corrections were determined empirically on the basis 
of only a few data sets and thus might not have been very reliable* Given these 
results y mathematical results based on corrected Item parameters were not 
obtained* 

Neither of the hypotheses regarding the accuracy of equating for approximate 
three-parameter logistic equatlng^ models was confirmed * Making corrections to 
Item parameter estimates and using more than five groupings may be unnecessary 
when an approximate three-parameter logistic equating model Is used with tests 
that have little form-to-form variation* This does not mean, however, that the 
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Item parameter estimates from Quantlle can be used along with those from LOGIST* 
It l8 important that the same method be used for estlsatlng Item parameters 
prior to score equating. The possibility of using estimates from different 
computer programs was beyond the scope of this study and deserves further 
investigation. 

Comparisons of Equating Results - All Three Approximate Models 

Figures 16 and 18 and Table 11 give the equating results for the three 
approximate equating models and their variations along with the results from the 
concurrent equating model from the SAT scale stability study. Contrary to 
expectation, the concurrent equating model did not yield the smallest amount of 
equating error. For both SAT-verbal and SAT-mathematical, concurrent equating 
had the largest amount of total error except for the modified three-parameter 
model. Even in comparison with the modified model, concurrent equating yielded 
more error at the ends of the score range. It is not clear vhy a more complex 
model would yield more error, particularly given the large sample sizes. 

Both the approximate three-parameter logistic model and the one-parameter 
logistic model performed better than modified three-parameter model. Primarily 
because of a general bias (mean difference), the modified three-parameter model 
produced scores that were up to 20 points too high for SAT-verbal and up to 7 
points too high for SAT -mathematical. The approximate three-parameter logistic 
model had little equating error for either type of equating. The one-parameter 
logistic model, interestingly enough, yielded the smallest amount of error for 
equating SAT-mathematical scores but considerably more error than the approxi- 
mate models for equating SAT-verbal scores. It is clear from Figure 19 that 
bias accounted for much of the mean squared error associated with the concurrent, 
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BICAL, and modified three-parameter model verbal equatlngs and with the modified 
three-parameter model mathematical equating. 

From these results It Is clear that none of the hypotheses regarding the 
comparisons of the three equating models was cooflrmed for either verbal or 
mathematical equating > This was true despite the fact that the SAT-nathematlcal 
equatlngs were more accurate than the verbal equatlngs. It Is not obvious why 
the mathematical equatlngs had less error, nor why the one-parameter logistic 
model had the smallest amount of equating error for SAT-mat heme t leal. 

Perhaps the most surprising finding of all was the performance of the 
various approximate equating models, which, because of their simplicity, were 
expected to yield more equating error than the concurrent model and the modified 
three-parameter mode}. These models performed exceedingly well. Is It possible 
that equating a test to Itself, even when It Involves a chain of equatlngs. Is 
biased In some unknown way? Or are the results really valid for test forms that 
are very similar to one another In content, length, and difficulty? 

Follow-up Analysis 

Perhaps the most Informative research that could have been Initiated was to 
Investigate to what extent the results for the approximate three-parameter 
logistic model were due to the method of transforming the Item parameters. For 
that model existing score equating parameters for converting raw scores to scaled 
scores were used to compute the transformation for equating Item parameter esti- 
mates. As a follov7-up analysis, this same procedure was used with the one- 
parameter logistic item parameters from BICAL and with the three-parameter 
logistic item parameters from LOGIST to see whether this type of item parameter 
transformation is superior to other methods of placing item parameter estimates 
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on a common scale* It is possible that the variation In the common Item sections 
used for Item calibration, for example, adds error to both the Item parameter 
estimation and transformation processes. Linking item parameter estimates by 
use of existing score equating results. If satisfactory, would greatly simplify 
the problem of creating large pools of items with parameters on a common scale* 

Item parameter estimates from BICAL and LOGIST were available on all of the 
data sets shown in Figure 2. Previously, the item parameters from LOGIST did 
not have to be transformed to a common scale because the items from a given pair 
of SAT forms and their common equating test were calibrated together. Thus, 
they were automatically on the same scale and did not have to be transformed. 
The item parameters from BICAL, however, had been transformed to a common scale 
through common items by use of the item transformation procedure built into 
BICAL. The follow-up analysis Involved using score equating information to 
derive transformations for the LOGIST and BICAL item parameter estimates that 
existed on the separate data sets used in the Petersen, Cook, and Stocking 
(1983) study. These were V4-fe, X2-fm, Y3-fw, B3-fk, Y2-fu, Z5-et, and V4-et 
for SAT-verbal and V4-ff , X2-fm, Y3-fx, B3-fl, Y2-fv, Z5-eu, and V4-eu for SAT- 
mathematical (see Figure 2). 

The method described in the section on the approximate three-parameter 
logistic model was used to derive item parameter transformations. The resulting 
transformations are shown in Table 12 along with the item parameter transfor- 
mations previously derived on common items (equating tests) for BICAL and the 
three-parameter logistic model. The method used to obtain the common-item 
transformations for the latter model was the characteristic curve method 
(Stocking & Lord, 1983). 
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Insert Table 12 about here 



The results of the equating using the itea paraoeter transfomatlons baaed 
on score equating infomation are shown in Table 13 and Figures 20» 21 » and 22« 
The codes used in Figures 20^ 21 » and 22 are as follows: 

CRIT: Criterion — initial scaled score^ 

CONCUR: Concurrent (three**pA>^Meter logistic) » 

LOGIST(MOD): LOGIST (three-parameter logistic) — modified to use 

item parw ater transformations derived from score 

equating Information » 
BICAL(MOD): BICAL (one parameter logistic) ^ modified to use item 

parameter transformations derived from score equating 

information^ and 

W05: Approximate three-^parameter logistic based on fifths 

without corrections to item parameter estimates* 



Insert Table 13 and Figures 20^ 21^ and 22 about here 



The results of the reanalysis show clearly that the use of score equating 
information to transform item parameters explains why the approximate three* 
parameter models had a small amount of mean squared errcr« Recall that the use 
of score equating information to derive item parameter transformations was built 
into the approximate three^arameter procedure. A comparison of Tables 11 and 13 
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Indicates that in the case of SAT-verbal the mean squared error decreased 
from 83.35 to 1.69 for the one-parameter model (BlCiO. vs. BICAL (modified)) 
and from 125.15 to 6.99 for the three-parameter model (concurrent vs. LOGIST 
(modified)). Urge decreases in the mean squared error for SAT-mathematical 
are also evident for the three-parameter model. One can infer that the use 
of item parameter transformations derived from score equating information was 
very effective in reducing mean squared error. One can also infer that the 
effectiveness of the approximate three-parameter models was due to the use 
of score equating information for deriving transformations. 

The reduction in mean squared error is such that the differences among the 
models utilizing score equating infoi,*ation is slight. However, one of the 
approximate models, in this case BICAL or BICAL (modified), was still the best 
model for either SAT-verbal or SAT-mathematical. 

The use of item parameter transformations based on score equating informa- 
tion needs further study before they can be applied operationally in testing 
programs. In this study the chain was short, and half of the common-item 
linkages used in the study were the very same ones that had been used in score 
equating. This probably created a situation in which the conversion parameters 
were more consistent with item calibration results than would be expected if 
items were calibrated for test forms widely separated in the genealogical chart. 
One needs to find out how well these transformations work when items from a 
variety of old forms, particularly those linked together by long equating 
chains, are calibrated. In conciwsion, item parameter transformations based on 
score equating information look FromlslnR, but need further testing. 




35 



- 30 - 



Recoinendatlong for Further Research 

Because of these unusual results, further research Is reconmended. One 
possibility Is to choose base fotTis other than Pom V4 and redo the chain equat** 
lng« If the sane results were obtained, the findings would be more generallz-' 
able, and chance compensating effects at Intermediate steps could be ruled out 
as a possible explanation for the results • 

One might also create a situation, as Marco, Petersen, and Sfnrart (1983) 
did. In which a test Is equated to a different test rather than to Itself. One 
could equate scores on a particular form to scores on the form that Is next to 
It In the original chain equating by proceeding both ways around the circle. 
Por example, scores on Form Z5 could be equated to Form V4 scores by two differ** 
ent paths* The results from these two equatlngs should agree If the equating 
method Is working properly. Unfortunately, this type of equaling Is not entirely 
definitive, because results for two different equatlngs might agree well but 
still not be correct. 

Further, as was suggested In the previous section, further evaluation Is 
needed of using score equating Information to derive Item parameter transforma** 
tlons. In the current study the number of links on the chain was limited. This 
could have created a situation that was favorable to using score equating Infor-* 
matlon for transforming Item parameters. The usefulness of the method should be 
evaluated In situations where score equating Is relatively Independent of the 
forms used In the experimental chain. 

Finally, and perhaps most Important, designs for evaluating equating should 
be studied under simulated conditions where the correct results are known. 
Ideally, the simulated conditions should not be based on any of the models for 
equating. Some useful Information could, however, be derived from studies using 
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the three-parameter logistic model to generate the data. However » other 
models should also be used to generate data. The use of simulated data would 
allow one to evaluate any bias that may be created when a test Is equated to 
itself. It would also allow one to evaluate the various equating models In 
situations In which a test Is equated to a different test. It is critical 
that the usefulness of the current design be evaluated so that decision-makers 
have a better basis for choosing equating models. 

The Air Force Human Resources Laboratory has recently Issued a report that 
reviews various methods of equating mental tests > Including IRT models (Glalluca» 
Crichton, & Vale, 198A This report and the IRT studies clted» plus a review 
of other research that has been conducted at ETS and elsewhere, should guide 
future research activities. 
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inary version of this -report was presented at the meetl.^ of the American 
Psychological Association In 1982 (Marco, Dovslass, & Wlngersky, Note !)• 
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Table 1 

Summary Statistics for SAT- Verbal and SAT-Mathematical Equating Sa^iles 



Verbal Mathematical 



Form 


Admin. 
Date 


Equating 
Test 


N 


Scaled Score^ 
Mean SD 


Equatlug 
Test 


N 


Scaled Score 
Mean SD 


V4 


12/73 


fe 


2665 


438 


114 


ff 


2628 


457 


115 


X2 


4/75 


fe 


2686 


437 


106 


ff 


2629 


476 


111 


X2 


4/75 


fm 


2562 


432 


106 


fn 


2527 


471 


111 


Y3 


6/76 


fm 


2578 


426 


112 


fn 


2553 


465 


113 


Y3 


1/78 


fw 


2549 


405 


109 


fx 


2455 


443 


117 


B3 


5/79 


fw 


2700 


433 


108 


fx 


2633 


479 


114 


B3 


5/79 


fk 


2665 


429 


104 


fl 


2596 


476 


111 


Y2 


4/76 


fk 


2879 


432 


108 


fl 


2815 


469 


115 


Y2 


4/76 


fu 


2774 


428 


105 


fv 


2721 


472 


115 


Z5 


12/77 


fu 


2853 


414 


108 


fv 


2774 


447 


114 


Z5 


12/77 


et 


2814 


417 


110 


eu 


2739 


444 


113 


V4 


12/73 


et 


2670 


436 


113 


eu 


2673 


455 


115 



Scaled score statistics are linear transformatlonii of raw score 
statistics and deviate slightly from reported score statistics In those 
cases where curvilinear transformations were used operationally. 
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Table 2 

A Description of Three Approxlvate 
IRT Equating Models 



Dimension 



One^-Parameter 
Logistic (Rasch) 



Approximate Three*- 
Parameter Logistic 



MDdlf led Three* 
Parameter Logistic 



Type of Data 



Method of Item 
Calibration 



One data set for each 
pair of equating tests 
and SAT forms 



BICAL 



One data set for each 

SAT form 



Quantlle (modified 
LOCIST) 



One data set for each 
pidr of equating tests 
and SAT fotms 



LOGIST 



Method of Item 
Parameter 

Transformation Equating of b^s Equating of ^'s Concurrent Calibration 



Table 3 

Transformations for Equating Item Difficulties 
Estimated from Different BICAL Item Calibrations 



Scale 

Relationship Verbal Mathematical 



X2 


to 


V4 




.090 


b - 


.131 


Y3 


to 


VA 


b, - 


.174 


b - 


.226 


B3 


to 


V4 


b - 


.214 


b - 


.182 


Y2 


to 


V4 


b - 


.130 


b - 


.189 


Z5 


to 


V4 


b - 


.130 


b - 


.116 


V4 


to 


V4 


b - 


.122 


t - 


.014 
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Tabltt 4 

CorractloM i^>pli•d Co Qacntlltt 
EstlMtiis of I JB PaxM»C«Tt 



Parameter Fifths IMntieths 







SAT-Veibal 






a 




1.184a - 


.047 


l.loea - 


no/. 


b 




.965b + 


.042 


1.0181) + 


.028 


£ 


estimated 


.935c + .015 


±»u^uc > 






conBBOn 


£.+ 


.006 




.001 






SAT-Ma thematlcal 






Pour- 


'chr»lce items* 










a 




l.i49a - 


.065 


1.051a - 


.038 


b 




I. 054b + 


.021 


1.062b + 


.042 




e8tlma2:ed 


1.046c - 


.008 


1.080c - 


.007 


£ 


common 


c_ - 


.034 


c - 


.001 


Five- 


•choice itemp; 










a 




1.079a - 


.011 


1.038a - 


.031 


b 




1.021b + 


.034 


1,039b + 


.028 


£ 


estimated 


1.037c. - 


.012 


1.065c - 


.009 


£ 


common 


c, - 


.031 


£ - 


.023 
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Table 5 

Scaling Parameters for SAT-Verbal and 
SAT-Mathematical Fom« 



Verbal Mathematical 



SAT Form 


A 


B 


A 


B 


V4 


6.9931 


193.2421 


8.8839 


272.5925 


X2 


6.9304 


193.0270 


8.6393 


265.4098 


Y3 


6.8813 


189.1449 


8.4892 


260.3593 


B3 


6.8315 


183.8779 


8.5734 


267.4198 


Y2 


7.2588 


184.5344 


8.5533 


269.7831 


25 


6.9A40 


200.4501 


8.4740 


273.1202 



^The 200 to 800 College Board scaled score (S) is determined 
by the formula S - AX B» where X Is the raw score (number 
right corrected for guessing). 
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Table 6 



i 



Transformations for Equating Itea Difficulties Estlaated from Different 
Item Calibrations Using die Computer Program QuantUe^ 



Scale 
tlonshlp 



Ottcorrected 



Corrected 



Fifths 



Twentieths 



Fifths 



Twentieths 



SAT-Verbal 



if Z2 


to 


V4 


• yjoD 




.944b - .055 


.942b - 


.044 


.950b - 


i w 


to 


V4 


1.013b - 


.326 


1.024b - .319 


1.017b - 


.311 


1.033b - 




to 


V4 


.916b - 


.068 


.915b - .073 


.915b - 


.054 


.921b - 




to 


V4 


.925b - 


.081 


.926b - .085 


.924b - 


.073 


.928b - 




to 


V4 


l.Ollb - 


.214 


1.02qb - .218 


1.016b - 


.202 


1.034b - 


r v4 


to 


V4 


.994b - 


.018 


.991b - .014 


.991b - 


.019 


.991b - 












SAT-Math«matlcal 








1 X2 


to 


V4 


.986b + 


.064 


.997b + .060 








1 Y3 


to 


V4 


1.127b - 


.243 


1.121b - .231 








1 


to 


V4 


.985b + 


.109 


.999b + .100 




Not 












Detezmlned 


I Y2 


to 


V4 


-< n39b + 


.058 


1.073b + .042 








Z5 


to 


V4 


97b - 


.233 


1.099b - .231 








V4 


to 


V4 


1.010b - 


.016 


1.015b - .017 









.3i 

.01 

.oil 

1 



The formula for transforming Item dlscrlnlnatlon parameters to the scale for Form V4 is a/A» 
vhere A is the slope parameter given in the table. 
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Coflparlflon of Quantlla RMults to TriM ValiMS and LOGIST iMulta for Artificial Data 



Ralativc to True Valuaa 



Relative to lOGISt Rcaulta^ 



Irue 
Values 



LOGIST 
Reaultft 



|b. of items - 43 

Hean Absolute Difference 

batween Item Response functions' 



.0177 



Qu^ntllaa 



5tb8 



.0173 



20th8 



.0171 



Ouantilcfl 



5th8 



.0169 



20tha 



.0131 



Paraaeter 
Mean 

Standard Deviation 
Mean Absolute Difference 
Root Mean Squared Error 
Mean Difference 
SO of Difference 
Corre! «^tlon 



.917 
.328 



.975 
.362 
.124 
.149 
.058 
.138 
.924 



.821 
.306 
.156 
.203 
.096 
.181 
.839 



-879 
.308 
.124 
.161 
.037 
.158 
.877 



.183 
.218 
.154 
.157 
.904 



.122 
.140 
-.096 
.104 
.965 



Paraaeter 
Mean 

Standard Deviation 
Mean Absolute Difference 
Root Mean Squared Error 
Mean Difference 
SD of Difference 
Correlatlor 



.202 
.987 



.201 
.993 
.106 
.143 
.001 
.144 
.989 



.100 
.984 
.153 
.238 
.013 
.240 
.970 



.156 
.907 
.144 
.196 
.046 
.192 
.983 



.128 
.209 
.012 
.211 
.977 



.118 
.160 
.046 
.155 
.991 



Paraoet er 
Mean 

Standard Deviation 

Mean Absolute Difference 

Root Mean Squared Error 

Mean Difference 

SD of Difference 

Correlation 



.195 
.053 



.195 
.061 
.035 
.043 
.000 
.044 
.716 



.175 
.092 
.062 
.082 
.020 
.081 
.482 



.178 
.086 
.055 
.076 
.017 
.074 
.512 



.048 
.073 
.020 
.071 
.639 



.040 
.060 
.017 
.059 
.732 



^hls Is the mean absolute difference between the Item reaponae functions averaged over all of the abilities In the 
criterion group and then averaged over all of the Iteea. 



This includes the c s fixed at the conmon c value. 
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4b 




UhU 8 

Coaparlson of Quiiacils SMults, wtthout Corr«ctloiis and vlth Corr«ctlott«» to LOGIST ftMuIts for Pom V4 

Relatlv to LOGIST Results 



V4f< 



V4«t 



w/o Corrections 



v/ Corrections 



w/o Corrections w/ Corrections 





LOGIST 


5tha 


20ths 


5ths 


20ths 


LOGIST 


5ths 


20thB 


5ths 


20tta 


No. of ItcM - 90 






















Mean Absolute Difference ^ 






















between Itea Response Functions 




.0090 


.0056 


.0101 


.0074 




.0090 


.0085 


.0122 


.0086 


s Psreaeter 






















Mean 


.782 


.748 


.776 


.839 


.823 


.762 


.757 


.785 


.850 


.633 


Standard Deviation 


.274 


.289 


.266 


.342 


.311 


.295 


.278 


.27* 


.329 


.321 


Mean Absolute Difference 




.09] 


.048 


.095 


.065 




.092 


.057 


.110 


.078 


Root Mean Squared Error 




.14> 


.084 


.180 


.107 




.143 


.082 


.181 


.192 


Mean Difference 




-.034 


-.006 


.057 


.041 




-.004 


.023 


.088 


.072 


SD of Difference 




.146 


.085 


.172 


.099 




.144 


.079 


.159 


.066 


Correlation 




.866 


.951 


.866 


.951 




.876 


.964 


.876 


.964 


b Parameter 






















Mean 


.304 


.277 


.256 


.309 


.288 


.326 


.303 


.276 


.335 


.309 


Standard Deviation 


1.313 


1.363 


1.330 


1.315 


1.354 


1.387 


1.374 


1.344 


1.325 


1.366 


Mean Absolute Difference 




.101 


.080 


.080 


.076 




.090 


.093 


.089 


.078 


Root Mean Squared Error 




.171 


.151 


.159 


.150 




.134 


.128 


.143 


.114 


Mean Difference 




-.027 


-.049 


.005 


-.016 




-.023 


-.050 


.009 


-.017 


SD of Difference 




.170 


.144 


.159 


.150 




.133 


.119 


.144 


• 113 


Correlation 




.993 


.994 


.993 


.994 




.995 


.997 


.995 


.997 


c Parameter 






















Mean 


.153 


.148 


.140 


.153 


.151 


.156 


.156 


.145 


.161 


.156 


Standard Deviation 


.060 


.053 


.055 


.050 


.056 


.050 


.056 


.063 


.053 


.064 


Mean Absolute Dlffere..ce 




.023 


.024 


.020 


.022 




.020 


.028 


.018 


.026 


Root Mean Squared Error 




.051 


.050 


.050 


.049 




.031 


.040 


.030 


.039 


Mean Difference 




-.005 


-.013 


.004 


-.002 




-.008 


-.011 


.000 


-.001 


SD of Difference 
Correlation 




.051 


.049 


.050 


.049 




.031 


.039 


.029 


.039 




.596 


.645 


.596 


.645 




.837 


.792 


.837 


.792 



Is the mean absolute difference between itc 
then averaged over all of the Items. 

b„ 



response functions averaged over all of the abilities In the criterion group and 



This Includes the c 's fixed at the common c value. 
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Initial and Pinal Tranafonatlona of Siff-Varbal Pon V4 



Saw Scoraa to Sealtd Scoraa for Chain Kquating 



Pinal Scalad Scora (Chain Equating) 



AyprooiiMta thraa-Paraaatar 



Kaw 

Scora Praq^ 





3 




8 




11 


o7 


1 


86 


26 


K% a« 


44 


P>' 


1 


K • OJ 


98 




4A 

JO 




1 

135 


K ou 


172 




4 IK 

215 


K . /a 


251 


m'-i 


1 A X 

164 


7& 

K , /D 


294 


W^' 7« 


4A A 

360 


■ /a 


aai 


■ 74 


533 




374 


a * 71 


590 


70 


667 


t. 69 


791 


1 


913 


1 67 


650 


66 


1002 


65 


1166 


64 


1313 


63 


1486 


62 


1022 


61 


1562 


60 


1801 


59 


1955 


58 


2182 


57 


1633 


56 


2228 


55 


2583 


54 


2694 


53 


2987 


52 


2202 


51 


3133 


50 


3474 


1 ^ 




1 EKIC 





Initial 
Sea J ad 
Scora 

822.62 
815.63 
806.63 
801.64 
794.65 
787.66 
780.66 
773.67 
766.68 
759.68 
752.69 
745.70 
738.70 
731.71 
724.72 
717.72 
710.73 
703.74 
696.75 
689.75 
682.76 
575.77 
668.77 
661.78 
654.79 
647.79 
640.80 
633.81 
626.81 
619.82 
612.83 
605.84 
598 <S4 
591.85 
584.86 
577.86 
570.87 
563.88 
556.88 
549.89 
542.90 



Piftha 



T^ntlatba 



Concur rant 


BICia« 


v/o Corr 


w/ Capip 


w/a Caw 

WfW «aWc 


y$i vorv 


822.62 


822.62 


822.62 


822.62 


822.62 


892 69 


821.64 


816.23 


816.90 


817.06 


815.90 


811 78 


818.56 


809.81 


810.65 


810.92 


808.25 




814.25 


803.37 


804.17 


S04.53 


OVA. J7 


802 41 


809.20 


796.90 


797.54 


797.^8 


79s 99 


791 71 


803.69 


790.39 


790.81 


791.31 


7119 7% 


789 .07 
/ .v# 


797.89 


783.87 


783.99 


784.56 


#o< .J X 


982.17 


791.90 


777.32 


777.11 


777.73 


975.78 


771 6A 


785.78 


770.75 


770.18 


770.85 


7a!9 01 


^788 88 


779.55 


764.16 


763.22 


763.93 




762.09 


773.20 


757.55 


736.24 


756.99 


79s.aii 


711.97 


766.75 


750.92 


749.25 


750.03 


768. SS 

f ^m%99 




760.18 


744.27 


742.24 


743.06 


'761*67 


741 ^11 


753.51 


737.60 


735.23 


736.07 


716 77 


714 69 


746.75 


730.91 


728.21 


729.07 


7^7.86 


727.68 


739.88 


724.21 


721.18 


722.06 


720.88 


790 70 


732.92 


717.50 


714.14 


715.04 


711 90 


711 7l 
# ft 


725.87 


710.76 


707.10 


7M.00 


7M 90 


708 70 


718.75 


704.02 


700.05 


700.95 


699 88 

077.QV 


690 66 


711.54 


697.26 


692 .99 


691.118 


692 81 


692 60 
07*. Dv 


704.30 


690.49 


685.91 




68S 77 


681 11 


696.99 


683.70 


678.83 


679.70 


878.68 


678.41 


689.63 


676.90 


671.73 


672.58 


671.18 


671 .98 


682.24 


670.09 


664.62 


665.45 


664.46 


664.14 


674.81 


663.27 


657.50 


658.31 


657.31 


656.97 


667.34 


656.44 


650.37 


651.15 


650.15 


649.79 


659.8; 


649 60 


643.23 


643.97 


642.98 


642.59 


652.37 


642.74 


636.08 


636.78 


C-3.79 


635.36 


644.86 


635.88 


628.91 


629.58 


628.58 


628.12 


637.34 


629.00 


621.74 


6^2.37 


621.36 


620.87 


629.81 


622.11 


614.56 


613.14 


614.13 


613.60 


622.26 


615.22 


607.37 


607.90 


606.88 


606.33 


614.73 


608.32 


600.16 


600.65 


599.63 


599.04 


607.19 


601.41 


592.95 


593.19 


592.37 


591.74 


599.68 


594.50 


585.73 


586.11 


585.11 


584.44 


592.20 


587.57 


578.50 


57^.82 


577.85 


577.14 


584.70 


580.63 


571.27 


571.52 


570.59 


569.85 


577.24 


573.69 


564.03 


564.21 


563.33 


562.56 


569.80 


566.74 


556.78 


556.90 


556.08 


555.28 


562.39 


559.79 


549.54 


549.58 


548.83 


548.00 


555.01 


552.83 


542.29 


542.26 


541.59 


540.75 



Mwitt 
MM 



14 



««». 
•5.1.4 

6>2*^ 
625.4 
618.4 
611.j 
6M*| 
3974 
5».l 

575. 
568. . 
561. 2il 
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Tabl* 9 (CotttlaiMd) 



riiiAl Scaled ScoM (Chain Iqaatint) 



3653 
3996 
3011 
4194 
4416 
4729 
4646 
3793 
5032 
5449 
5492 
5635 
4420 
5794 
5942 
5933 
6072 
4537 
5860 
5996 
6005 
5990 
4521 
5359 
5381 
5325 
5002 
3719 
4666 
4491 
4356 
4086 
2933 
3472 
3320 
3112 
2836 
1967 
2315 
2209 
1967 
1754 
1120 
1290 



InltUl 
Scaled 
Scora 

535.90 
528.91 
521.92 
514.92 
507.93 
500.94 
493.95 
486.95 
479.96 
472.97 
465.97 
458.96 
451.99 
444.99 
438.00 
431.01 
424.01 
417.02 
410.03 
403.04 
396.04 
389.05 
382.06 
375.06 
368.07 
361.06 
354.08 
347.09 
340.10 
333.10 
326.11 
319.12 
312.12 
305.13 
298.14 
291.15 
284.15 
277.16 
270.17 
263.17 
256.18 
249.19 
242.19 
235.20 
228.21 



riftha 



Taantlatba 



CoDcumat 

547.67 
540.36 
533.09 
525.69 
518.73 
511.59 
504.49 
497.44 
490.42 
483.44 
476.46 
469.56 
462.64 
455.74 
448.85 
441.95 
435.04 
428.11 
421.16 
414.16 
407.14 
400.05 
392.93 
385.76 
378.54 
371.25 
363.92 
356.55 
349.10 
341.60 
334.05 
326.44 
318.77 
311.05 
303.29 
295.48 
287.65 
279.80 
271.98 
264.18 
256.45 
248.76 
241.23 
233.61 
226.54 



BICAL 

545.65 
538.66 
531.90 
524.91 
517.91 
510.91 
503.91 
496.90 
489.68 
482.86 
475.63 
466.60 
461.77 
454.73 
447.66 
440.63 
433.56 
426.52 
419.46 
412.39 
405.32 
398.25 
391. !7 
384.09 
377.01 
369.92 
362.83 
355.74 
348.64 
341.54 
334.43 
327.33 
320.22 
313.10 
305.96 
296.67 
291.74 
284.62 
277.49 
270.35 
263.22 
256.08 
248.94 
241.79 
234.64 



w/o Gorr 

535.04 
527.60 
520.56 
513.34 
506.12 
496.92 
491.74 
464.56 
477.44 
470.33 
463.24 
456.16 
449.15 
442.14 
435.16 
428.22 
421.30 
414.41 
407.56 
400.72 
393.92 
387.14 
380.38 
373.65 
366.94 
360.25 
353.57 
346.91 
340.27 
333.63 
327.01 
320.40 
313.79 
307.19 
300.56 
293.98 
287.37 
280.76 
274.14 
267.52 
260.68 
254.23 
247.56 
240.67 
234.16 



«/ Corr «/o Corr 



534.95 

527.63 
520.33 
513.04 
505.77. 
496.51 
491.26 
464.08 
476.90 
469.76 
462.64 
455.57 
446.52 
441.51 
434.54 
427.60 
420.69 
413.82 
406.96 
400.17 
393.40 
366.65 
379.92 
373.22 
366.54 
359.67 
353.23 
346.60 
339.96 
333.37 
326.76 
320.16 
313.59 
307.00 
300.41 
293.62 
267.22 
280.61 
274.00 
267.37 
260.72 
254.06 
247.36 
240.66 
233.95 



534.37 
527.16 
519.97 
5U.60 
505.65 
496.52 
491.41 
464.32 
477.26 
470.23 
463.21 
456.22 
449.25 
442.31 
435.36 
4V.46 
421.60 
414.73 
407.69 
401.06 
394.24 
367.45 
360.66 
373.69 
367.14 
360.39 
353.66 
346.93 
340.22 
333.51 
326.62 
320.13 
313.45 
306.77 
300.10 
293.43 
^66.76 
280.09 
273.42 
266.74 
260.06 
253.36 
246.65 
239.92 
233.17 



«/ Corr 

533.51 

526.29 

519.09 

511.92 

504.77 

497.66 

490.56 

463.50 

476.47 

469.47 

662.49^ 

455.54 

448.62 

441.72 

414.64 

427.99 

421.16 

414.35 

407.55 

400.78 

394.02 

367.27 

360.54 

373.62 

367.12 

360.42 

353.74 

347.06 

340.39 

333.73 

327.06 

320.44 

313.80 

307.17 

300.54 

293.91 

267.28 

260.65 

274.01 

267.36 

260.69 

254.01 

247.31 

240.56 

233.63 



Modiflai 

faMMftar^ 



554.0$ 

546.6$^ 

)JI«61 

53a.»^ ^ 
52S.ir 

5MI.62; 
50^11:^ 
<96.0fr^ 

mm} 

459.11 

45i;jt* 
a5idi 

437.69 

430.36^ 

423.03 

415.7Q 

406.36 

401.01 

393.66 

366.29 

376.93 

371.56 

364.16 

356.60 

349.42 

342.02 

334.63 

327.23 

319.63 

312.43 

305.03 

297.63 

290.24 

262.65 

275.47 

266.09 

260.73 

253.37 

246.03 

236.71 

231.40 



TAblt 9 (ConCiniMd) 



Pinal 8cAl«d Scor« (Chain Equating) 



Approxiaaca Thraa-ParaMtar 





Initial 




^aled 


Preq 
" " ~ 


Score 


1057 


221.21 


853 


214»22 


484 


207.23 


522 


200.24 


484 


193.24 


358 


186.25 


265 


179.26 


112 


172.26 


129 


165.27 


91 


158.28 


52 


151.28 


30 


144.29 


6 


137.30 


14 


130.30 


6 


123.31 


3 


116.32 


2 


109.32 


0 


102.33 


1 


95.34 


0 


88.35 


0 


81.35 


0 


74.36 


0 


67.37 


0 


60.37 


0 


53.38 


0 


46.39 


0 


39.39 



Pifcha 



T^tttiacha 



Concur lent 

219.44 
212.52 
205.78 
199.22 
192.84 
186.63 
180.59 
174.70 
168.97 
163.67 
156.46 
149.24 
142.01 
134.79 
127.56 
120.34 
113.11 
105.89 

98.66 

91.44 

84.21 

76.99 

69.76 

62.54 

55.31 

48.09 

40.86 



BICAL 

227 .49 
220.33 
213.17 
206.01 
198.84 
191.67 
184.49 
177.31 
170.13 
162.94 
155.74 
148.54 
141.34 
134.12 
126 .91 
119.68 
112.45 
105.21 
97.97 
90.71 
83.45 
76.17 
68.88 
61.58 
54.26 
46.93 
39.39 



w/o Corr 

227.42 
220.65 
213.85 
207.01 
200.12 
193.19 
186.20 
179.15 
172.03 
164.82 
157.36 
150.23 
143.09 
135.96 
128.82 
121.69 
114.55 
107.42 
100.28 
93.15 
86.01 
78.88 
71.74 
64.61 
57.47 
50.34 
43.20 



«/ Corr v/o Corr 



227.20 
220.41 
213.59 
206.71 
199.80 
192.82 
185.79 
178.69 
171.53 
164c21 
15^.39 
149.77 
142.65 
135.53 
128.41 
121.29 
114.17 
107.05 
99.93 
92.81 
85.69 
78.57 
71.45 
64.33 
57.21 
50.09 
42.97 



226.39 
219.59 
212.75 
205.87 
198.94 
191.9^ 
184.91 
177.80 
170.61 
163.31 
155.87 
148.23 
141.16 
134.09 
127 .02 
119.95 
112.88 
105.81 
98.74 
91.67 
84.59 
77.52 
70.45 
63.38 
56.31 
49.24 
42.17 



«/ Corr 

227.04 
220.20 
213.32 
206.39 
199.19 
192.32 
185.16 
177.90 
170.52 
162.93 
155.29 
148.23 
141.16 
134.10 
127.03 
119.97 
112.91 
105.84 
98.78 
91.71 
84.65 
77.59 
70.52 
63.46 
56.39 
49.33 
42.27 



NodifUa^ 

fmvmmtmt 

2UAI 
2li*i» ' 

202.H 

195412 ^ 

MM 

IM.n 

173»5t 

166»44> 

1S9.J6 

ia.5i 

145.50 
136.47 
131.45 
124.42 
117.59 
110.37 
103.54 

56.51 

•1.29 

82.26 

75.24 

68.21 

61.18 

54.16 

47.13 

40.10 



* SAT-verlMil font V4 raw scora fraquancy distrihution for initial Dacaabar 1973 adainlairation. 
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TabU 10 

Initial and Pinal Tranafonationa of SAT-NitlMaCieal fbm n Saw Scoraa to Soalod Scona for Chain Iqoating 

Pinal Scalod Seora (Chain Iquatint) 



Scorn Praq^ 



60 
59 
58 
57 
56 
55 
54 
53 
52 
51 
50 
49 
48 
47 
46 
45 
44 
43 
42 
41 
40 
39 
38 
37 
36 
35 
34 
33 
32 
31 
30 
29 
28 
27 
26 
25 
24 
23 
22 
21 
20 



34 
114 
205 
47 
317 
475 
638 
795 
430 
878 
1094 
1312 
1555 
1H4 
1647 
1939 
2235 
245? 
2117 
2588 
2859 
3213 
3560 
3059 
3838 
4022 
4385 
4761 
3817 
5053 
5248 
5511 
5870 
4779 
6034 
6466 
6576 
6894 
5470 
6934 
6844 



Initial 
Scaiad 
Scorn 

805.63 
7f6.74 
787.86 
778.97 
770.09 
761.21 
752.32 
743.44 
734.56 
725.67 
716.79 
707.90 
699.07 
690.14 
681.25 
672.37 
663.48 
654.60 
645.72 
636.83 
627.95 
619.06 
610.18 
601.30 
592.41 
583.53 
574.65 
565.76 
556.88 
547.99 
539.11 
530.23 
521.34 
512.46 
503.57 
494.69 
485.81 
476.92 
468.04 
439.15 
450.2> 



Coocnrinnt 

805.63 
799.30 
791.93 
784.12 
776.15 
768.13 
760.07 
751.93 
743.70 
735.35 
726.G6 
718.26 
709.52 
700.66 
691.67 
682.57 
673.37 
664.06 
654.68 
645.24 
635.74 
626.23 
616.68 
607.14 
597.61 
588.10 
578.64 
569.23 
559.89 
550.62 
541.45 
532.36 
523.36 
514.46 
505.63 
496.88 
488.19 
479.55 
470.95 
462.36 
453.78 



8ICAL 

805.63 
796.88 
788.07 
779.22 
770.34 
761.44 
752.52 
743.59 
734.66 
725.73 
716.79 
707.85 
698.91 
689.97 
681.04 
672.11 
663.18 
654.25 
645.32 
636.39 
627.47 
618.54 
609.63 
600.71 
591.79 
582.88 
573.97 
565.06 
556.15 
547.25 
538.35 
529.45' 
520.56 
511.66 
502.77 
493.89 
485.00 
476.12 
467.24 
458.36 
449.48 



Piftte' 

805.63 
796.06 
787.01 
776.16 
769.44 
760.79 
752*19 
743.61 
735.03 
726.44 
717.83 
709.20 
700.53 
691.83 
683.08 
674.30 
665.47 
656.59 
647.67 
638.72 
629.71 
620.67 
611.59 
602.48 
593.35 
584.20 
575.04 
565.88 
556.72 
547 .57 
538.44 
529.34 
520.26 
511.20 
502.19 
493.20 
484.24 
475.32 
466.43 
457.56 
448.72 



MMtiatha 

805.63 
797.41 
788.92 
780.41 
771.90 
763.36 
754.82 
746.23 
737.61 
728.94 
720.23 
711.46 
702.66 
693.80 
684.89 
675.94 
666.93 
657.38 
648.79 
639.65 
630.47 
621.25 
612.01 
602.74 
593.45 
584.16 
574.87 
565.59 
556.33 
547.10 
537.90 
528.73 
519.61 
510.54 
501.51 
492.52 
483.58 
474.67 
465.80 
456.96 
448.16 



IMifini 
Thraa* 
Paraaatar 

78».85 

781.11 

ifi.n 

763«3« 
7S4«S7 

737.68 ^ 

7».15 

7^.64 

712.14 

703.64 

695.13 

686.Sf 

678.02 

669.40 

660.73 

652.02 

643.26 

634.45 

6;3.60 

616.71 

607 .7f 

598.83 

589.85 

580.86 

571.85 

562.84 

553.83 

544.81 

535.81 

526.83 

517.84 

508.88 

499.93 

490.99 

482.07 

473.16 

464.27 

455«39 



T«bU 10 (CoatlniMd) 



Plful Ac«l«4 Scot* (Chain BquAting) 







iQitlal 






ApproKlMt« 














wmtmv 




Uv 




Scalsd 








b 




Icora 


Preq* 


Score 






fifths^ 




ParwMtar 


19 


7186 


441.39 


445*20 


440.60 


439.89 


439*38 




18 


7206 


432.50 


436*59 


431.73 


431*09 


430.63 


A 17 

nJt .QQ 


ri7 


5444 


423.62 


427 • 95 




422.50 


421.89 




16 


6909 


414.73 


419*27 


413.98 


413.53 


413*17 


A10 Oil 


15 


6986 


405.85 


* 410*54 


405*11 


404.77 


404*47 


All 1 A 


14 


7014 


396.97 


401 9 74 


396.24 


396.02 


395*79 




' 13 


6696 


388.08 


392*88 


1117 17 


387.28 


387*11 


AA 


12 


5006 


379.20 


383*95 


O* 


378.55 


378*45 


1ft A ^A 


11 


6025 


370.32 


374*92 


369*63 


369*83 


369.80 


17 S 67 


10 


5948 


361.43 


365*79 




361*13 


361*16 


^AA 7 ft 


9 


5588 


352.55 


356*53 


351 ,90 


352*43 


352.54 


1S7 117 


8 


5411 


343.66 


347*13 


343.03 


343.75 


343*93 


14A 04 


7 


3576 


334.78 


337*57 


334*16 


335*08 


335*34 


110 00 


6 


4577 


325.90 


327 *82 


325.30 


326*42 


326.77 


111 AA 


5 


5 


317.01 


317*87 


316*43 


317*78 


318.23 


A *7 / 


4 


3o49 


308.13 


307*71 


307 *57 


309*15 


309*70 


119 00 


3 


3405 


299.24 


297.38 


298*70 


300*53 


301.21 


303.78 


2 


2065 


290.36 


286*91 


289*84 


291*93 


292.74 


294.60 


1 


2448 


281.48 


276.39 


280.98 


283*34 


284*31 


9IIS IS 


0 


2049 


272.59 


265*95 


272*13 


274*78 


275.91 


976 09 

A / V. VA 


-1 


1621 


263.71 


255*74 


263.27 


266*24 


267.57 


966 60 

AQQ.QV 


-2 


1152 


254.82 


245.88 


254*41 


257.74 


259*30 


9S7 AH 

A3/ .UO 


-3 


456 


245.94 


236*44 


245.56 


249.26 


251*09 


9A7 AA 


-4 


500 


237.06 


227.18 


236.71 


240.78 


242*96 


917 6S 


-5 


324 


228.17 


216.09 


227 *86 


232*17 


234*86 


997 7 A 
AA/ ./U 


-6 


186 


219.29 


211*51 


219.01 


222*96 


226*30 


217.56 


'-^ 


97 


210.41 


203.07 


210.16 


213.91 


217*06 


208.69 




17 


201.52 


194*64 


201.32 


204.87 


207.^3 


201*70 


-9 


17 


192.64 


186.20 


192.47 


195*82 


198.59 


192*80 


-10 


4 


183.75 


177.76 


183.62 


186.78 


189*35 


183.91 


-11 


0 


174.87 


169*33 


174.78 


177*73 


180*11 


175.02 


-12 


0 


165.99 


160.89 


165.93 


168.69 


170*87 


166.12 


-13 


0 


157.10 


152*46 


157*07 


159.64 


161*64 


157.23 


-14 


0 


146.22 


144.02 


148.21 


150.60 


152*40 


148.33 


-15 


0 


139.33 


135.58 


139.33 


141.55 


143*16 


139.44 



SAT-nathematicftl Form V4 rav acoro f raquancy dlatributloQ for loltial Dacaabar 1973 adalalacratloQ. 



No correct ioQ was uaad* 
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Table 11 





Inforw 


itlon end Suhm 


iiy Discrepancy Xnd^ ee* for 


Itea leaponse Theor^r Equating Nodele 








Initial 
SceXe 

(Crlurlop^ 








Approxlaete Three-PeraMter 




Modified 
Three- 
ParaMter 


Index 


Concurrent 


BICAt 


Plftlw 
v/o OoTteetlone 


rifUia 
v/ OorractioQa 


IWentiadui 
v/o CbrraecloQa 


Awntletha 
v/ Correctloaa 




Seeled Score; 
Keen 

Stenderd Devletlon 
Meen Squered Brror^ 
Mien Difference 
SD of Difference 


435.37 
109.09 


445.73 
112.89 

125.15 

10.36 

4.23 


444.45 
109.60 

83.35 

9.08 

.96 


434.93 
108.58 

5.75 

-.44 

2.36 


434.67 
108.79 

7.38 

-.70 

2.62 


434.75 
108.54 

4.SJ 

-.62 

2.11 


434.45 
108.17 

7.01 

-.92 

2.48 


448.72 
113.38 

198.51 

13.35 

4.49 


SAT-netheMtlcel 


Seeled Scores 


















Meen 

Stend i Devletlon 


468.05 
11?. 25 


471.61 
115.50 


467.40 
113.32 


467.81 
113.43 




467.82 
113.59 




473.35 
113.63 


b 

Meen Squared Error 




23.27 


.46 


1.61 


Not 


3.78 


Hot 


28.67 


Meen Difference 
of Difference 




3.55 
3.26 


-.66 
.18 


-.24 
1.25 


Dei,«rmlned 


-.23 
1.9j 


Determined 


.81 



00 



>ated for 5AT-verbel rcw ecoree 1 through 80 and for SAT-eiatheMtlcal raw acoree -8 through 55. 



^(SD of Difference)^ (Meen Difference) 
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Table 12 

Transformations for Equating Item Difficulties Estimated from 
Different Item Calibrations Using BICAL (One-Parameter Model]^ 
and LOGIST (Three-Farameter Model) 



Scale One-Parameter Three-Parameter 

Relationship Common- I tern Score-Conversion Common-Item Score-Conversion 

SAT-Verbal 



X2 




V4 


b 




.090 


.877b - 175 




• uuz 


0^1 K 
• 7 JXD 




• ujo 


Y3 




V4 


b 




.174 


.861b - .189 


.957b - 


,297 


.995b 




.312 


B3 




V4 


b 




.214 


.851b - .177 


.839b - 


.074 


.897b 




.065 


Y2 




V4 


b 




.130 


.906b - .166 


.849b - 


.087 


.910b 




.082 


Z5 




V4 


b 




.130 


.876b - .065 


.903b - 


.231 


.997b 




.184 


V4 




V4 


b 




.122 


1.014b - .046 


.898b - 


.092 


.979b 


+ 


.002 














SAT-Mathematical 












X2 




V4 


b 




.131 


.996b - .159 


.948b + 


.096 


.993b 


+ 


.066 


Y3 




V4 


b 




.226 


.950b - .205 


1.068b - 


.211 


1.109b 




.215 


B3 




V4 


b 




.182 


1.021b - .074 


.942b + 


.064 


1.005b 


+ 


.106 


Y2 




V4 


b 




.189 


.956b - .113 


.959b + 


.031 


1.048b 


+ 


.058 


Z5 




V4 


b 




.116 


.971b - .101 


.9s?b - 


.259 


1.051b 




.224 


V4 




V4 


b 




.014 


1.010b - .013 


.998b - 


.083 


1.010b 




.027 



ERIC 



5V 



Table 13 



Infoimatlon and Sumnary Discrepancy Indices for 
: Selected Item Response Theory Eqtiatlng Models 



Index 


Initial Scale 
(Criterion) 


Concurrent 


BICAL 


Characteristic Curve LOGIST 
Transformation (Modified) 


(Modified) 






SAT" 


-Verhfl 1 








Scaled Score: 














Mean 


435.37 


445.73 


444.45 


446.34 


434.50 


434.33 


Si.andard Deviation 


109.09 


112.89 


109.60 


116.47 


106.61 


108.33 


Mean Squared Error^ 




125.15 


83.35 


178.98 


6.99 


1.69 


Mean Difference 




10.36 


9.08 


10.97 


-.87 


-1.04 


S.D. of Difference 




4.23 


.96 


7.66 


2.50 


.79 


SAT-Mathematical 


Scaled Score: 














Mean 


468.05 


471.61 


467.40 


474.95 


468.49 


467.39 


Standard Deviation 


113.25 


115.50 


113.32 


114.82 


112.87 


1.12.36 


Mean Squared Error 




23.27 


.46 


61.08 


7.28 


1.31 


Mean Difference 




3.55 


-.66 


6.90 


.44 


-.66 


S.D* of Difference 




3.26 


.13 


3.67 


2.66 


.93 



Computed for SAT-verbal raw scores 1 through 80 and for SAT-mathematlcal raw scores -8 through 55. 



(SD of Difference) + (Mean Difference) . 
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Six Verbal Equatlngs 
V4 ->• fe -»• X2 fm -»■ Y3 -»■ fw 

et Z5 ^ fu Y2 fk B3 



Six Mathematical Equatlngs 

V4 ff X2 fn Y3 fx 

* i 

eu Z5 fv Y2 fl B3 



Figure 1. Verbal and mathematical equating chains. 
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Verbal Data Sets 



Mathenatlcal Data Sets 



V4 £• 



V4 ff 



fe X2 



ff X2 



X2 fm 



'A2 fn 



fa Y3 



fn Y3 



Y3 fw 



Y3 fx 



fw B3 



fx B3 



B3 fk 



B3 fl 



fk Y2 



fl Y2 



Y2 fu 



Y2 fv 



fu Z5 



fv Z5 



Z5 et 



ZS eu 



•t V4 



•u V4 



ERIC 



Figure 2. Verbal and imthMMtieal data sett. Each box represents 
a sanple of approxiaataJy 2»670 cases. 
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t.e 




-rM. 



M 



1 

^3 




Legend 

* - one or both c's were fixed 
at the common c value 

♦ or # - neither c was fixed at the 
conmon c value 



ERIC 



Figure 3. Comparison of Quanttle parameter estimates for fifths to 
true values. Artificial data. 

62 



VV3 



Legend 



-I 



* - one ot both c*s were fixed 
at the conunon c value 

4 or * - neither c was fixed at the 
conmon c value 



• tiD* mm tMjm 



Figure 4. Conparlson of QuantUes parameter estimates for twentieths 
to true values. Artificial data. 

63 




• I LDoisT mm 



Figure 5. Comparison of Quantila parameter estimates for fifths 
to LOGIST parameter estimates. Artificial data. 
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• Mtm I ixoitT mm 



Legend 

* - one or both c's were fixed 
at the common c value 

or # - neither c was fixed at the 
common c value 



Comparison of Quantlle parameter estimates for twentieths 
to LOGIST parameter estimates. Artificial data. 



65 





Legend 

Item response function using the true values 

of the parameters 

" ^tem response function using the quant ile estimated 

parameters for the Sths grouping 

O - Observed proportion correct for the true abilities 
grouped into intervals of .4. The size is pro- 
portional to the number of abilities 

O - Observed proportion correct for the abilities 
estimated by the quantilec program 



J 



Figure 7, Item ability regressions for true values And 
for Quantile paraioeter estimates for fifths. 



67 

v'^./S--. . ■ . > < . < .-c.-j,> ji* Joy&c 




ERIC 



Figure 8 . Comparison of QuantHe parameter estimates for the 
fifths to LOGIST parameter estimates for form V4FE. 
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i.O 




^-1 
1 

'J 

J. 




Legend 

* - one or both c's were fixed 
at the common c value 

♦ or - neither c was fixed at the 
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Figure 9. Comparison of ^ de parameter est^^iates corrected for 
bias for fifths tu i-JGIST parameter estimates for form 
V4FE. 
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Figure 10. Comparison of Quantile parameter estimates ^or 
twentieths to LOGIST parameter estimates for 
form V4PB* 




Figure 11. Comparison of Quantile parameter estimates corrected 
for bias for twentieths to LOGIST parameter estimates 
for form V4FE. 
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Figure 12, 




Comparison of Quantile parameter estimates for fifths 
to LOGIST parameter estimates for form V4ET. 
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Figure li. Comparison of Quantile parameter estimates corrected for 

bias for fifths to LOGIST parameter estimates for form V4ET. 
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Figure 14, Comparison oi Quantile parameter estimates for 
twentieths to LOG' arameter estimates for 
form V4EI. 



74 



65 - 




-1 1 — r 

O.t 1.0 I.* 

A - V4rr ORIQ wooxrr mm 



t.o 




o!t o!j o!4 

V4Cr QRIO LOOMT MM 




Legend 

- one or both c*8 were fixed 
at the common c value 

or * - **elther c was fixed at the 
common c vrlue 



ERLC 



Figure 15. Comparison of Quantile parameter estimates corrected for 
bias for twentieths to LOGIST parameter Mtlmates for 
form V4p, 
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FIGURE 18 
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FIGURE 20 
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FIGURE 2 2 
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